Data Cleanup for Building Permits for Seattle Department of Construction and Inspections¶

   
Author MG
Date Nov 28, 2023
Description This notebook documents the process of data cleaning of
  the dataset Building Permits offered by Seattle Open
  Data.

How to Read this Document¶

This notebook shows the cleaning process and as such any results need to be read in a certain context. For instance, some figures are shown to understand the initial data which is later corrected. The actual analysis is described in a different document. At the end is the section that tests certain corrections on a final dataframe.

Prepare: Data Source¶

The source of data is here. The data is provided by Seattle Department of Construction and Inspections of the City of Seattle, Seattle Open Data – Building Permits.

Data Organization¶

The original data is contained in a single csv file.

Data Credibility¶

The data is provided by the government entity and I assume they follow standards for collecting and publishing credible data.

Data Bias¶

It is reasonable to expect that the government agency that prepared the data, Seattle Department of Construction and Inspections, has taken measures to minimize any bias in the data.

License and Attribution¶

The data is available as a Public Domain dataset.

In [ ]:
# To render properly in HTML, I set
# exporting via VS Code "Jupyter:Export to HTML" results in showing the 
# figures properly, but also the input.
#
# for exporting externally and hide the code-cells in the export file do:
# jupyter nbconvert --to html --no-input file.ipynb
import plotly.io as pio
pio.renderers.default='notebook'

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# for maps
import plotly.express as px
import plotly.graph_objects as go 
from plotly.subplots import make_subplots

# Instead of setting the cell to Markdown, create Markdown from withnin a code cell!
# We can just use python variable replacement syntax to make the text dynamic
from IPython.display import Markdown as md

# for file manipulation
import os.path

import myutils as ut

# set rows and columns to show all of it
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# set the width of the column with no limits
pd.set_option('display.max_colwidth', None)

# setup for seaborn plots
sns.set_style('darkgrid')
sns.set_palette('Set2')

# the style of the map
MAP_STYLE = "open-street-map"
#MAP_STYLE = "stamen-terrain"

# the path to the original data
DATA_PATH_ORG='input/Building_Permits_20231024_saved.zip'
# the path to the processed data
DATA_PATH_PROCESSED='input/Building_Permits_20231024_clean.zip'

# the path to the description of the fields
DESC_PATH_ORG='input/csv_descr.csv'
# temporary file to work with
DATA_PATH_TEMP='input/Building_Permits_20231024_temp.csv'

Process: Data Cleanup¶

The short description of the dataset is presented below. There are 25 columns and 168018 entries.

Pandas issues the warning that columns 6 and 10 have mixed values.

I added the parameter "thousands=','" to the function call pd.read_csv, otherwise it complains that column 6 have mixed values.

To accomodate the mixed types warning for column 10, I converted it explicitly to the date type. Since there are other dates, I also explicitly specify it when reading the file. The results are presented below.

In [ ]:
df_org = ut.read_to_df(DATA_PATH_ORG, False)
df_org.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168018 entries, 0 to 168017
Data columns (total 25 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   PermitNum              168018 non-null  object        
 1   PermitClass            161578 non-null  object        
 2   PermitClassMapped      161578 non-null  object        
 3   PermitTypeMapped       168018 non-null  object        
 4   PermitTypeDesc         157825 non-null  object        
 5   Description            167731 non-null  object        
 6   HousingUnits           26697 non-null   float64       
 7   HousingUnitsRemoved    79331 non-null   float64       
 8   HousingUnitsAdded      79331 non-null   float64       
 9   EstProjectCost         133261 non-null  float64       
 10  AppliedDate            129287 non-null  datetime64[ns]
 11  IssuedDate             122470 non-null  datetime64[ns]
 12  ExpiresDate            122626 non-null  datetime64[ns]
 13  CompletedDate          84417 non-null   datetime64[ns]
 14  StatusCurrent          168018 non-null  object        
 15  RelatedMup             13689 non-null   object        
 16  OriginalAddress1       167581 non-null  object        
 17  OriginalCity           161380 non-null  object        
 18  OriginalState          161470 non-null  object        
 19  OriginalZip            157323 non-null  object        
 20  ContractorCompanyName  28358 non-null   object        
 21  Link                   168018 non-null  object        
 22  Latitude               165229 non-null  float64       
 23  Longitude              165229 non-null  float64       
 24  Location1              165229 non-null  object        
dtypes: datetime64[ns](4), float64(6), object(15)
memory usage: 32.0+ MB

The description of attributes is presented below:

In [ ]:
df_desc_org = pd.read_csv(DESC_PATH_ORG, sep='|', header=None)
df_desc_org
Out[ ]:
0 1
0 PermitNum The tracking number used to refer to this permit in SDCI's tracking system.
1 PermitClass The permit class tells you the type of project.\t
2 PermitClassMapped A description of whether the permit is for a residential or non-residential project.
3 PermitTypeMapped The permit type by category, such as building, demolition, roofing, grading, and environmentally critical areas.
4 PermitTypeDesc Additional information about the type of permit. For example, whether it is an addition/alternation or a new project.
5 Description A brief description of the work that will be done under this permit. This description is subject to change before SDCI issues the permit. The description is generally more stable if we have issued the permit. Very long descriptions have been truncated.
6 HousingUnits The number of housing units included at the beginning of the project.
7 HousingUnitsRemoved The number of housing units removed during the project.
8 HousingUnitsAdded The number of housing units added during the project.
9 EstProjectCost The estimated project cost of the work being proposed is based on fair market value (parts plus labor). The estimated cost (if any) represents the best available information to date, and is subject to change if the project is modified. We do not collect the estimated project cost for all permit types.
10 AppliedDate The date SDCI accepted the application as a complete submittal.
11 IssuedDate The date SDCI issued the permit. If there is an Application Date but no Issue Date, this generally means the application is still under review.
12 ExpiresDate The date the application is due to expire. Generally, this is the date by which work is supposed to be completed (barring renewals or further extensions). If there is not an Expiration Date, this generally means the permit has not been issued.
13 CompletedDate The date the permit had all its inspections completed. If there is an Issue Date but not a Completed Date, this generally means the permit is still under inspection.
14 StatusCurrent The current status in the application/review/inspection life cycle. This status shows the most recent process step that was fully completed.
15 RelatedMup The land use permit that is related to this building permit, if there is one.
16 OriginalAddress1 The street name and number of the project.
17 OriginalCity The city for the project's address.
18 OriginalState The state for the project's address.
19 OriginalZip The Zip code for the project's address.
20 ContractorCompanyName The contractor(s) associated with this permit.
21 Link A link to view full details and current status information about this permit at SDCI's website.
22 Latitude Latitude of the worksite where permit activity occurs. May be missing for a small number of permits considered "unaddressable."
23 Longitude Longitude of the worksite where permit activity occurs. May be missing for a small number of permits considered "unaddressable."
24 Location1 The latitude and longitude location for mapping purposes.

Incorrect Data Types¶

Let's look at the data types. I also use the description of data types showed on the data source website. The main discrepancy is by automatically classifying attributes as object whereas on the website they are defined as Text. There are also attributes classified as float64 whereas they might be just integer values. Specifically,

  • PermitNum, PermitClass, PermitClassMapped, PermitTypeMapped, PermitTypeDesc, Description were classified by Python as object; the website classifies them as Text,
  • Columns 6-9 were classified in Python as float64; the website classifies it as Number.
  • IssuedDate, ExpiresDate, CompletedDate were classified by Python as

object; the website classifies them as Floating Timestamp,

  • Columns 15-18,20-21 were classified by Python as object; on the website as Text,
  • Column 19 (OriginalZip) was classified by Python as float64; the

website defines it as Text so I read it as object to behave the same as other object columns that should be string (see later).

  • Latitude, Longitude were classified on the website as Number
  • Location1 was classified Location on the website.
  • HousingUnits, HousingUnitsRemoved, HousingUnitsAdded: my checks

show that these columns have missing values and only integer values, so they will be read as integers. There has been an issue with reading a column with numbers and NaN values and is automatically classified as float. I cannot specify explicitly Int64 as there is a ValueError reported: ValueError: Unable to parse string "1,014" so I need to convert those columns in a separate step.

  • Other columns apart from Longitude and Latitude are as object that

also supports strings so I leave them as they are.

I decided to leave Location1 as an object.

Missing Data¶

Let's find out what data is missing. In terms of the missing values, RelateMup, HousingUnits, and ContractorCompanyName have >80% values missing. RelateMup is a related land use permit and might be absent.

The missing data may be a problem for a particular business question that should be addressed if needed.

In [ ]:
ut.count_missing(df_org).sort_values(['%Missing'], ascending=False)
Out[ ]:
Attribute #NotNull #Missing Total %Missing
15 RelatedMup 13689 154329 168018 91.85
6 HousingUnits 26697 141321 168018 84.11
20 ContractorCompanyName 28358 139660 168018 83.12
7 HousingUnitsRemoved 79331 88687 168018 52.78
8 HousingUnitsAdded 79331 88687 168018 52.78
13 CompletedDate 84417 83601 168018 49.76
11 IssuedDate 122470 45548 168018 27.11
12 ExpiresDate 122626 45392 168018 27.02
10 AppliedDate 129287 38731 168018 23.05
9 EstProjectCost 133261 34757 168018 20.69
19 OriginalZip 157323 10695 168018 6.37
4 PermitTypeDesc 157825 10193 168018 6.07
17 OriginalCity 161380 6638 168018 3.95
18 OriginalState 161470 6548 168018 3.90
1 PermitClass 161578 6440 168018 3.83
2 PermitClassMapped 161578 6440 168018 3.83
22 Latitude 165229 2789 168018 1.66
23 Longitude 165229 2789 168018 1.66
24 Location1 165229 2789 168018 1.66
16 OriginalAddress1 167581 437 168018 0.26
5 Description 167731 287 168018 0.17
14 StatusCurrent 168018 0 168018 0.00
21 Link 168018 0 168018 0.00
3 PermitTypeMapped 168018 0 168018 0.00
0 PermitNum 168018 0 168018 0.00

Descriptive Statistics¶

The descriptive stats show some interesting facts that I will take a closer look at.

In [ ]:
df_org.describe(include='all')
Out[ ]:
PermitNum PermitClass PermitClassMapped PermitTypeMapped PermitTypeDesc Description HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip ContractorCompanyName Link Latitude Longitude Location1
count 168018 161578 161578 168018 157825 167731 26697.000000 79331.000000 79331.000000 1.332610e+05 129287 122470 122626 84417 168018 13689 167581 161380 161470 157323 28358 168018 165229.000000 165229.000000 165229
unique 168017 6 2 5 18 141033 NaN NaN NaN NaN NaN NaN NaN NaN 24 7283 78687 6 2 46 5992 168017 NaN NaN 72949
top 6870287-CN Single Family/Duplex Residential Building Addition/Alteration Construct addition and alterations to existing single family residence, per plan. NaN NaN NaN NaN NaN NaN NaN NaN Completed 3012843-LU 701 5TH AVE SEATTLE WA 98103 U OF W BUILDING PERMIT https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6870287-CN NaN NaN (47.60454088, -122.33073506)
freq 2 95813 115278 135499 98111 390 NaN NaN NaN NaN NaN NaN NaN NaN 116817 32 429 161333 161469 11078 595 2 NaN NaN 428
mean NaN NaN NaN NaN NaN NaN 38.934000 0.209792 3.057103 1.275588e+06 2014-09-24 15:00:37.841391872 2014-10-29 00:25:35.122070784 2016-07-22 17:48:24.930438912 2014-07-29 06:30:20.533778944 NaN NaN NaN NaN NaN NaN NaN NaN 47.625378 -122.334198 NaN
min NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000e+00 1986-04-28 00:00:00 1986-07-03 00:00:00 2001-10-24 00:00:00 2005-02-22 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.495809 -122.430368 NaN
25% NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 1.506400e+04 2010-04-02 00:00:00 2010-05-06 00:00:00 2012-01-19 00:00:00 2010-06-17 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.582701 -122.363300 NaN
50% NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 6.500000e+04 2015-03-20 00:00:00 2015-04-02 00:00:00 2016-12-08 00:00:00 2014-10-21 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.627543 -122.333558 NaN
75% NaN NaN NaN NaN NaN NaN 1.000000 0.000000 1.000000 2.543950e+05 2019-01-16 00:00:00 2019-03-04 00:00:00 2020-12-04 00:00:00 2018-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.670168 -122.304099 NaN
max NaN NaN NaN NaN NaN NaN 1000000.000000 272.000000 891.000000 2.500000e+10 2023-10-20 00:00:00 2023-10-20 00:00:00 2025-05-18 00:00:00 2023-10-20 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.733972 -122.221100 NaN
std NaN NaN NaN NaN NaN NaN 6120.246962 2.129944 23.233911 9.736551e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.055536 0.038203 NaN

Values for Categorical Attributes¶

Below the values for categorical attributes are presented.

  • PermitTypeDesc: there are values NONE and <NA> there are 388 NONE values. I assume that they are equivalent to the <NA> value semantically. Therefore, I decided to change the NONE values to <NA>.
  • OriginalCity: there are variations of Seattle spelling (and an incorrect one). I decided for SEATTLE spelling.
  • OriginalState: there are variations of WA spelling. I decided for WA spelling.
In [ ]:
'''
    Prints out unique values of specified columns

    @param df_ A dataframe
    @param cols_ The list of columns the unique values need to be printed out
    @return None
'''
def print_uniq_vals(df_, cols_):
    for c in cols_:        
        print(f"\n---------- {c}: {df_[c].unique()}")

print("**** Unique values for specific observations:")
print_uniq_vals(df_org, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped', 
                'PermitTypeDesc', 'StatusCurrent', 'OriginalCity', 
                'OriginalState'])

'''
    It prints additional information to examine
    @param
    @return 
'''
def print_extra_info(df_):
    print( f"The number of 'NONE' values in PermitTypeDesc: {len(df_org[df_org['PermitTypeDesc'] == 'NONE'])}")


print("**** Extra information about data:")
print_extra_info(df_org)
**** Unique values for specific observations:

---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily'
 'Commercial' 'Institutional' nan]

---------- PermitClassMapped: ['Non-Residential' 'Residential' nan]

---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof'
 'ECA and Shoreline Exemption/Street Improvement Exception Request']

---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan
 'Change of Use Only - No Construction' 'Deconstruction' 'Temporary'
 'Relocation' 'Curb Cut' 'NONE' 'Environmentally Critical Area Exemption'
 'Shoreline Exemption' 'Tenant Improvement Pre-Approval'
 'Street Improvement Exception' 'Relief from Prohibition on Steep Slope'
 'Modification to Submittal Requirements' 'Shoreline Permit Exemption'
 'ECA Small Project Waiver']

---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process'
 'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required'
 'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed'
 'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired'
 'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated'
 'Ready for Intake' 'Scheduled and Submitted' 'Pending']

---------- OriginalCity: ['SEATTLE' nan 'Seattle' 'TUKWILA' 'KING COUNTY' 'seattle' 'SEATLLE']

---------- OriginalState: ['WA' nan 'wa']
**** Extra information about data:
The number of 'NONE' values in PermitTypeDesc: 388

The columns' unique values after corrections are presented below:

In [ ]:
# replace 'NONE' with '<NA>' in PermitTypeDesc
df_org.loc[ df_org['PermitTypeDesc'] == 'NONE', ['PermitTypeDesc']] = np.nan
#len(df_org[ df_org['PermitTypeDesc'] == 'NONE' ])

# make uppercase letters for the city
df_org['OriginalCity'] = df_org['OriginalCity'].str.upper()
# fix the spelling of Seattle
df_org.loc[ df_org['OriginalCity'] == 'SEATLLE', ['OriginalCity']] = 'SEATTLE'

# make uppercase letters for the state
df_org['OriginalState'] = df_org['OriginalState'].str.upper()

#print_uniq_vals(df_org, ['PermitTypeDesc', 'StatusCurrent', 'OriginalCity', 'OriginalState'])

print_uniq_vals(df_org, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped', 
                'PermitTypeDesc', 'StatusCurrent', 'OriginalCity', 
                'OriginalState'])
---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily'
 'Commercial' 'Institutional' nan]

---------- PermitClassMapped: ['Non-Residential' 'Residential' nan]

---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof'
 'ECA and Shoreline Exemption/Street Improvement Exception Request']

---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan
 'Change of Use Only - No Construction' 'Deconstruction' 'Temporary'
 'Relocation' 'Curb Cut' 'Environmentally Critical Area Exemption'
 'Shoreline Exemption' 'Tenant Improvement Pre-Approval'
 'Street Improvement Exception' 'Relief from Prohibition on Steep Slope'
 'Modification to Submittal Requirements' 'Shoreline Permit Exemption'
 'ECA Small Project Waiver']

---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process'
 'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required'
 'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed'
 'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired'
 'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated'
 'Ready for Intake' 'Scheduled and Submitted' 'Pending']

---------- OriginalCity: ['SEATTLE' nan 'TUKWILA' 'KING COUNTY']

---------- OriginalState: ['WA' nan]

Uppercase Letters¶

I changed OriginalAddress1 to uppercase letters:

In [ ]:
df_org['OriginalAddress1'] = df_org['OriginalAddress1'].str.upper()
df_org['OriginalAddress1'].describe()
Out[ ]:
count          167581
unique          76199
top       701 5TH AVE
freq              436
Name: OriginalAddress1, dtype: object

Duplicates: PermitNum¶

This attribute shows that there is one record that is not unique; it is a duplicate PermitNum value. I examined the rows for that index: 11830 and 11831, and other attributes have the same values as well. So those two rows are identical and can be safely removed from the dataset.

It looks this is the only identifier in this dataset and it should be unique per dataset. There are no duplicate rows found either.

In [ ]:
'''
    print duplicates in permit
    @param df_ data frame
    @param col_ the name of the col you want to analyze
    @return Nothing
'''
def print_dup(df_, col_):
    #d_series = ut.find_duplicates(df_, [col_])   
    d_series = df_.duplicated(subset=[col_])
    if d_series.any():
        # print duplicates for unique do df_.loc[~d_series]
        print(df_.loc[d_series])
        
print_dup(df_org, 'PermitNum')
        PermitNum  PermitClass PermitClassMapped PermitTypeMapped  \
11831  6870287-CN  Multifamily       Residential         Building   

      PermitTypeDesc  \
11831            New   

                                                                                                                        Description  \
11831  Establish use as single family residence with 2 attached accessory dwelling units and construct 3-unit townhouse, per plans.   

       HousingUnits  HousingUnitsRemoved  HousingUnitsAdded  EstProjectCost  \
11831           NaN                  NaN                NaN        555755.0   

      AppliedDate IssuedDate ExpiresDate CompletedDate         StatusCurrent  \
11831  2021-12-08        NaT         NaT           NaT  Corrections Required   

       RelatedMup   OriginalAddress1 OriginalCity OriginalState OriginalZip  \
11831  3038111-LU  8661 BEACON AVE S          NaN            WA         NaN   

      ContractorCompanyName  \
11831                   NaN   

                                                                                    Link  \
11831  https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6870287-CN   

        Latitude   Longitude                     Location1  
11831  47.524625 -122.288077  (47.52462452, -122.28807667)  
In [ ]:
# rows with the duplicate PermitNum are identical
df_org.drop_duplicates(subset='PermitNum', keep='last', inplace=True)
df_org.describe(include='all')
Out[ ]:
PermitNum PermitClass PermitClassMapped PermitTypeMapped PermitTypeDesc Description HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip ContractorCompanyName Link Latitude Longitude Location1
count 168017 161577 161577 168017 157436 167730 26697.000000 79331.000000 79331.000000 1.332600e+05 129286 122470 122626 84417 168017 13688 167580 161380 161469 157323 28358 168017 165228.000000 165228.000000 165228
unique 168017 6 2 5 17 141033 NaN NaN NaN NaN NaN NaN NaN NaN 24 7283 76199 3 1 46 5992 168017 NaN NaN 72949
top 6818110-CN Single Family/Duplex Residential Building Addition/Alteration Construct addition and alterations to existing single family residence, per plan. NaN NaN NaN NaN NaN NaN NaN NaN Completed 3012843-LU 701 5TH AVE SEATTLE WA 98103 U OF W BUILDING PERMIT https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6818110-CN NaN NaN (47.60454088, -122.33073506)
freq 1 95813 115277 135498 98111 390 NaN NaN NaN NaN NaN NaN NaN NaN 116817 32 436 161378 161469 11078 595 1 NaN NaN 428
mean NaN NaN NaN NaN NaN NaN 38.934000 0.209792 3.057103 1.275593e+06 2014-09-24 14:31:19.331095040 2014-10-29 00:25:35.122071040 2016-07-22 17:48:24.930439168 2014-07-29 06:30:20.533778688 NaN NaN NaN NaN NaN NaN NaN NaN 47.625379 -122.334198 NaN
min NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000e+00 1986-04-28 00:00:00 1986-07-03 00:00:00 2001-10-24 00:00:00 2005-02-22 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.495809 -122.430368 NaN
25% NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 1.506050e+04 2010-04-02 00:00:00 2010-05-06 00:00:00 2012-01-19 00:00:00 2010-06-17 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.582701 -122.363300 NaN
50% NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 6.500000e+04 2015-03-20 00:00:00 2015-04-02 00:00:00 2016-12-08 00:00:00 2014-10-21 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.627546 -122.333558 NaN
75% NaN NaN NaN NaN NaN NaN 1.000000 0.000000 1.000000 2.543732e+05 2019-01-16 00:00:00 2019-03-04 00:00:00 2020-12-04 00:00:00 2018-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.670168 -122.304104 NaN
max NaN NaN NaN NaN NaN NaN 1000000.000000 272.000000 891.000000 2.500000e+10 2023-10-20 00:00:00 2023-10-20 00:00:00 2025-05-18 00:00:00 2023-10-20 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN 47.733972 -122.221100 NaN
std NaN NaN NaN NaN NaN NaN 6120.246962 2.129944 23.233911 9.736588e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.055535 0.038203 NaN
In [ ]:
# check if there are any duplicates
print(f"Are there any duplicates in dataframe? Duplicates found:\n{ut.are_there_duplicates(df_org)}")
Are there any duplicates in dataframe? Duplicates found:
Series([], dtype: bool)

Data Ranges¶

In this section, I will get a better picture related to values in the columns without looking into specific cells.

PermitNum and Description¶

Some Description observations have values that are not valid descriptions such as ':', ';', '.', 'test'

In [ ]:
md(f"The attribute *PermitNum* is a string {df_org.PermitNum.str.len().min()}"
   f"-{df_org.PermitNum.str.len().max()} characters long.<BR>"
   f"The attribute *Description* is {df_org.Description.str.len().min()}"
   f"-{df_org.Description.str.len().max()} characters long.")
Out[ ]:

The attribute PermitNum is a string 10-14 characters long.
The attribute Description is 1.0-499.0 characters long.

In [ ]:
# a few examples of *Description*
print(df_org[df_org.Description.str.len() < 5]['Description'].head(10))
30830     Demo
58837     Demo
108327    T.I.
130093    test
130370    stfi
130439     ULS
132367    TEST
132865     TI.
132892      TI
133381    VOID
Name: Description, dtype: object

Numerical Observations¶

  • HousingUnits have 1000000 value. According to Redfin this condo was built in 2019, and the number of units added was 22. I assume that this is a mistake and I assign there value of 22, i.e., the number of added housing units.
  • EstProjectCost contains one $25B
In [ ]:
def correct_housing_units(df_):
    idx = df_[df_['PermitNum'] == '6409645-CN'].index
    df_.loc[idx, 'HousingUnits'] = df_.loc[idx, 'HousingUnitsAdded']

correct_housing_units(df_org)
print(df_org[df_org['HousingUnits'] > 10000])

#print(df_org[df_org['HousingUnits'] > 500])
#print(df_org[df_org['HousingUnitsAdded'] > 500])

#print(df_org[df_org['EstProjectCost'] > 10000000].loc[:, ['PermitNum', 'PermitClass', 'HousingUnits', 'EstProjectCost']])

df_org.describe(include=[float])
      # 'HousingUnitsRemoved','HousingUnitsAdded', 'EstProjectCost'].describe()
Empty DataFrame
Columns: [PermitNum, PermitClass, PermitClassMapped, PermitTypeMapped, PermitTypeDesc, Description, HousingUnits, HousingUnitsRemoved, HousingUnitsAdded, EstProjectCost, AppliedDate, IssuedDate, ExpiresDate, CompletedDate, StatusCurrent, RelatedMup, OriginalAddress1, OriginalCity, OriginalState, OriginalZip, ContractorCompanyName, Link, Latitude, Longitude, Location1]
Index: []
Out[ ]:
HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost Latitude Longitude
count 26697.000000 79331.000000 79331.000000 1.332600e+05 165228.000000 165228.000000
mean 1.477432 0.209792 3.057103 1.275593e+06 47.625379 -122.334198
std 11.886870 2.129944 23.233911 9.736588e+07 0.055535 0.038203
min 0.000000 0.000000 0.000000 0.000000e+00 47.495809 -122.430368
25% 0.000000 0.000000 0.000000 1.506050e+04 47.582701 -122.363300
50% 0.000000 0.000000 0.000000 6.500000e+04 47.627546 -122.333558
75% 1.000000 0.000000 1.000000 2.543732e+05 47.670168 -122.304104
max 1014.000000 272.000000 891.000000 2.500000e+10 47.733972 -122.221100
In [ ]:
'''
    Get some info about the project cost

    @param df_ the dataframe
    @param min_cost_ the minimum cost the project is considered expensive
    @return None
'''
def print_info_EstProjectCost(df_, min_cost_):
    labels = [ '<=100K', '(100K, 1M]', '(1M, 3M]', '(3M, 10M]', ('>10M') ]
    print(labels.index)
    df = df_['CostBinned'] = pd.cut( 
        df_['EstProjectCost'],
        [0, 1.0e+05, 1.0e+06, 3.0e+06, 10.0e+06, df_org['EstProjectCost'].max()],
        labels = labels,
        include_lowest=True, ordered=True)
    # df_['CostBinned'] = pd.qcut(df_org['EstProjectCost'], q= 5)

    # how many projects in each project cost category
    s = df_['CostBinned'].value_counts()
    
    cost_binned_df = pd.DataFrame({'CostBinned' : s.index, 'Count' : s.values})
    print("The table that shows projects grouped by cost and the number of projects in each group:")
    print(cost_binned_df)

    # create a color palette; colors on two pie charts are going
    # in reverse order, so make sure it is less confusing if we
    # have the same color for the same category
    color_map = {}
    for i in range(0, len(labels)):
        color_map[labels[i]] = px.colors.qualitative.Plotly[i]
    
    # without specifying 'color' the color_discrete_map is ignored
    px.pie(cost_binned_df, values='Count', names='CostBinned', 
    title='Project Counts per Estimated Project Cost', 
    color='CostBinned', # color is taken by taking a unique list of this column
    color_discrete_map=color_map,
    hole=0.3).update_traces(
        textposition='inside', textinfo='percent+label',        
    ).show()

    #np.digitize(df_['EstProjectCost'], bins=[100000, 1000000, 3000000,10000000])
    
    # specify 'color' property if you want 'color_discrete_map'
    px.pie(df_, values='EstProjectCost', names='CostBinned', 
           hole=0.3, color='CostBinned', color_discrete_map=color_map,
           title='Estimated Project Cost per Total Value in a Group in US dollars.').show()

    # show the most expensive projects
    df_most_expensive = df_.loc[df_.EstProjectCost >= min_cost_, 
            [ 'PermitNum', 'PermitClass', #'PermitClassMapped', #'HousingUnits', 
             'HousingUnitsAdded', 'EstProjectCost'
             #,  'StatusCurrent'
            ]]
    print(f"EstProjectCost >= {min_cost_}: {len(df_most_expensive)}.")
    print("Most expensive projects sorted in descending order: ")
    print(df_most_expensive.sort_values(by = 'EstProjectCost', ascending=False).head(15))

'''
    boxplot EstProjectCost according to permit class
    
    @param df_ dataframe
    @param max_cost_ set up the max cost that should be considered

    @return Plotly figure
'''
def boxplot_EstProjectCost(df_, max_cost_):
    fig = px.box(df_[df_.EstProjectCost <= max_cost_], 
                 x='PermitClass', y = 'EstProjectCost', points='all',
                 title=f"Estimated Project Cost with max cost <= {max_cost_} US dollars.")
    return fig

print_info_EstProjectCost(df_org, 0.5e+08)
<built-in method index of list object at 0x7f0625906b00>
The table that shows projects grouped by cost and the number of projects in each group:
   CostBinned  Count
0      <=100K  79021
1  (100K, 1M]  45935
2    (1M, 3M]   4358
3   (3M, 10M]   2007
4        >10M   1939
EstProjectCost >= 50000000.0: 475.
Most expensive projects sorted in descending order: 
         PermitNum    PermitClass  HousingUnitsAdded  EstProjectCost
167552  6973064-DM    Multifamily               52.0    2.500000e+10
167551  6973063-CN    Multifamily               52.0    2.500000e+10
166997  6961341-CN    Multifamily                NaN    1.560000e+09
155573  6830099-CN  Institutional                NaN    1.443000e+09
152116  6781193-CN    Multifamily                NaN    1.000000e+09
152117  6781194-DM    Multifamily                NaN    1.000000e+09
164007  6879594-EX  Institutional                NaN    5.000000e+08
33821   6694765-PH     Commercial                0.0    4.587680e+08
22117   6706025-PH     Commercial              891.0    3.706000e+08
49      6973996-PH    Multifamily                NaN    3.210309e+08
56609   6438159-PH     Commercial                0.0    2.922400e+08
167795  6977429-CN  Institutional                NaN    2.750000e+08
157280  6744090-CN     Commercial              400.0    2.750000e+08
157281  6744091-DM     Commercial              400.0    2.750000e+08
49910   6436184-PH     Commercial              208.0    2.709659e+08
In [ ]:
boxplot_EstProjectCost(df_org, 0.1e+08).show()

Dates: AppliedDate, ExpiresDate, IssuedDate, CompletedDate¶

  • AppliedDate values are [1986-04-28; 2023-10-20]
  • ExpiresDate [2001-10-24; 2025-05-18]
  • IssuedDate [1986-07-03; 2023-10-20]
  • CompletedDate [2005-02-22; 2023-10-20]

There are 23-27% missing values for all dates but CompletedDate which lacks almost 50% of values.

In [ ]:
'''
    It prints info about the dates in the dataframe
    @param df_ The data frame

    @return None
'''
def print_dates(df_):
    
    dates = df_[['AppliedDate', 'ExpiresDate', 'IssuedDate', 'CompletedDate']]
    print(ut.count_missing(dates))

    print(f"len(df_) = {len(df_)}, len(dates)={len(dates)}")
    print("Descriptive stats regarding the dates:")
    print(dates.describe())
    print(dates.head())
    
# show some info about the dates
print_dates(df_org)
       Attribute  #NotNull  #Missing   Total  %Missing
0    AppliedDate    129286     38731  168017     23.05
1    ExpiresDate    122626     45391  168017     27.02
2     IssuedDate    122470     45547  168017     27.11
3  CompletedDate     84417     83600  168017     49.76
len(df_) = 168017, len(dates)=168017
Descriptive stats regarding the dates:
                         AppliedDate                    ExpiresDate  \
count                         129286                         122626   
mean   2014-09-24 14:31:19.331095040  2016-07-22 17:48:24.930439168   
min              1986-04-28 00:00:00            2001-10-24 00:00:00   
25%              2010-04-02 00:00:00            2012-01-19 00:00:00   
50%              2015-03-20 00:00:00            2016-12-08 00:00:00   
75%              2019-01-16 00:00:00            2020-12-04 00:00:00   
max              2023-10-20 00:00:00            2025-05-18 00:00:00   

                          IssuedDate                  CompletedDate  
count                         122470                          84417  
mean   2014-10-29 00:25:35.122071040  2014-07-29 06:30:20.533778688  
min              1986-07-03 00:00:00            2005-02-22 00:00:00  
25%              2010-05-06 00:00:00            2010-06-17 00:00:00  
50%              2015-04-02 00:00:00            2014-10-21 00:00:00  
75%              2019-03-04 00:00:00            2018-02-05 00:00:00  
max              2023-10-20 00:00:00            2023-10-20 00:00:00  
  AppliedDate ExpiresDate IssuedDate CompletedDate
0  2023-10-20         NaT        NaT           NaT
1  2023-10-20         NaT        NaT           NaT
2  2023-10-20         NaT        NaT           NaT
3  2023-10-20         NaT        NaT           NaT
4  2023-10-20         NaT        NaT           NaT
In [ ]:
md(f"<ul><li>The attribute <i>OriginalAddress1</i> is a string {df_org.OriginalAddress1.str.len().min()}"
   f"-{df_org.OriginalAddress1.str.len().max()} characters long.</li>"
   f"<li>The attribute <i>RelatedMup</i> is {df_org.RelatedMup.str.len().min()}"
   f"-{df_org.RelatedMup.str.len().max()} characters long.</li>"
   f"<li>The attribute <i>Link</i> is {df_org.Link.str.len().min()}"
   f"-{df_org.Link.str.len().max()} characters long"
   f"<li>The attribute <i>ContractorCompanyName</i> is {df_org.ContractorCompanyName.str.len().min()}"
   f"-{df_org.ContractorCompanyName.str.len().max()} characters long"   
   f"</ul>"
   )
Out[ ]:
  • The attribute OriginalAddress1 is a string 7.0-31.0 characters long.
  • The attribute RelatedMup is 10.0-34.0 characters long.
  • The attribute Link is 81-85 characters long
  • The attribute ContractorCompanyName is 3.0-80.0 characters long

Histograms for Select Columns¶

Below are presented histograms to get a better insight into data. We can see that OriginalCity and OriginalZip might be worth to take a closer look at.

In [ ]:
'''
    Show histogram

    @param df_ dataframe
    @param x_ what you want to see a list of columns
    @return nothing
'''
def show_hist(df_, x_):
    for x in x_:
        fig = px.histogram(df_, x = x, text_auto=True)
        fig.show()

'''
    show box plots for a given columns in a dataframe
    @param df_ a dataframe
    @param x_ a list of columns names to plot a boxplot

    @return  Nothing
'''
def show_box_plots(df_, x_):
    for x in x_:
        fig = px.box(df_, x = x)
        fig.show()


show_hist(df_org, 
              ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped', 
               'PermitTypeDesc', 'StatusCurrent', 'RelatedMup','OriginalCity', 
               'OriginalZip'])
In [ ]:
px.box(df_org, y = ['HousingUnitsRemoved', 'HousingUnitsAdded'], points='all')
#show_box_plots(df_org,
#               ['HousingUnits', 'HousingUnitsRemoved', 
#                'HousingUnitsAdded', 'EstProjectCost'])

OriginalZip and OriginalCity¶

There is only one permit for TUKWILA which CurrentStatus is Awaiting Information. If you search for the permit number 6915033-CN at Permit and Property Records and look at the documents Construction Permit in Record Snapshot.pdf and Geotechnical Report.pdf the location address is in Seattle and not in Tukwila, 3201 S NORFOLK ST SEATTLE, WA 98118, and 3301 South Norfolk Street in Seattle, respectively. Since there is a discrepancy regarding the exact project location, I use the address from the Construction Permit Record.

And one permit for KING COUNTY that was Withdrawn. If you search Permit and Property Records and search for the address 8819 S 116TH PL it will show SEATTLE as a city of the project and there are paperwork for the bathroom. So I changed it from KING COUNTY to SEATTLE.

In [ ]:
ut.count_missing(df_org[['OriginalCity', 'OriginalZip']])
Out[ ]:
Attribute #NotNull #Missing Total %Missing
0 OriginalCity 161380 6637 168017 3.95
1 OriginalZip 157323 10694 168017 6.36
In [ ]:
df_org[df_org['OriginalCity'] == 'TUKWILA']
Out[ ]:
PermitNum PermitClass PermitClassMapped PermitTypeMapped PermitTypeDesc Description HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip ContractorCompanyName Link Latitude Longitude Location1 CostBinned
5708 6915033-CN Commercial Non-Residential Building New Construct new commercial building (Prologis building 4), portion of building in City of Tukwila, occupy per plan. Mechanical is included. 0.0 0.0 0.0 1732632.0 2022-10-26 NaT NaT NaT Awaiting Information NaN 10300 EAST MARGINAL WAY S TUKWILA WA 98168 NaN https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6915033-CN 47.510472 -122.291367 (47.51047209, -122.29136723) (1M, 3M]
In [ ]:
df_org[df_org['OriginalCity'] == 'KING COUNTY']
Out[ ]:
PermitNum PermitClass PermitClassMapped PermitTypeMapped PermitTypeDesc Description HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip ContractorCompanyName Link Latitude Longitude Location1 CostBinned
6258 6924436-CN Single Family/Duplex Residential Building Addition/Alteration Interior alterations to bathroom in SFR, subject to field inspection, STFI. NaN NaN NaN 10500.0 2022-09-26 2022-09-26 2024-03-26 NaT Withdrawn NaN 8819 S 116TH PL KING COUNTY WA 98178 NaN https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6924436-CN 47.498304 -122.2211 (47.49830434, -122.22109991) <=100K
In [ ]:
'''
    Get the index of the record
    @param df_ dataframe
    @param permitnum_ a string representing a permit number

    @return An index representing the row location of a permit number in a dataframe
'''
def get_idx(df_, permitnum_):
    return df_[df_['PermitNum'] == permitnum_].index

# correct the Tukwila address to the Seattle address:
permit = '6915033-CN'
idx = get_idx(df_org, permit)
df_org.loc[idx, ['OriginalAddress1', 'OriginalCity', 'OriginalZip']] = ['3201 S NORFOLK ST', 'SEATTLE', '98118']
print('After the correction of the TUKWILA record:')
df_org.loc[get_idx(df_org, permit), ['PermitNum', 'OriginalAddress1', 'OriginalCity', 'OriginalZip']]
After the correction of the TUKWILA record:
Out[ ]:
PermitNum OriginalAddress1 OriginalCity OriginalZip
5708 6915033-CN 3201 S NORFOLK ST SEATTLE 98118
In [ ]:
permit = '6924436-CN'
df_org.loc[get_idx(df_org, permit), 'OriginalCity'] = 'SEATTLE'
print("After correction of the OriginalCity of the KING COUNTY to SEATTLE:")
df_org.loc[get_idx(df_org, permit), ['PermitNum', 'OriginalAddress1', 'OriginalCity', 'OriginalZip']]
After correction of the OriginalCity of the KING COUNTY to SEATTLE:
Out[ ]:
PermitNum OriginalAddress1 OriginalCity OriginalZip
6258 6924436-CN 8819 S 116TH PL SEATTLE 98178
In [ ]:
md(f"<ul><li>The attribute <i>OriginalZip</i> is a string {df_org.OriginalZip.str.len().min()}"
   f"-{df_org.OriginalZip.str.len().max()} characters long.</li>"
   f"</ul>")
Out[ ]:
  • The attribute OriginalZip is a string 1.0-5.0 characters long.

OriginalZip had found 24 non-5 characters long zip codes, and I manually searched them, and then updated them. The address of the record:

  • 11718, 6873274-CN, 1615 23RD AVE,

was not found in Seattle but in Longview. But the address 1516 23rd Ave, Seattle, WA 98122 was found in Seattle so I changed it to this address assuming this is the typo.

In [ ]:
'''
    Print information about the zip codes
    @param df_ dataframe
    @return df_masked.index the index of the masked dataframe (zip codes < 5 chars)
'''
def print_zip_info(df_):
    print(f"Print records with OriginalZip that has less than 5 characters:")
    mask = df_['OriginalZip'].str.len() < 5
    df_masked = df_[mask]
    print(df_masked[['PermitNum','OriginalAddress1', 'OriginalZip']])
    
    return df_[mask].index

'''
    corrects the zip codes 
    @param df_ dataframe
    @param idx_ index of records that need to have their zip code updated
    @return df_ some of the fields changed
'''
def correct_zips(df_, idx_):

    zips = ['98107','98109','98199','98122','98101','98117','98117',
            '98119','98178','98144','98115','98122','98117','98117',
            '98105','98112','98112','98122','98117','98107','98107',
            '98134','98134','98118']
    assert len(idx_) == len(zips), 'The zip codes do not match the number of records to be corrected'
    # correct the address
    df_.loc[11718,'OriginalAddress1'] = '1516 23RD AVE'
    mask = df_['OriginalZip'].str.len() < 5

    for i in range(len(zips)):
        df_.loc[idx_[i], 'OriginalZip'] = zips[i]
    
    print("After correction the zip codes:")
    print(df_[mask][['PermitNum', 'OriginalAddress1', 'OriginalZip']])

# find the zip codes and correct them
zip_idx = print_zip_info(df_org)
#print(zip_idx)

correct_zips(df_org, zip_idx)
Print records with OriginalZip that has less than 5 characters:
         PermitNum          OriginalAddress1 OriginalZip
440     6968761-CN       6049 SEAVIEW AVE NW           0
6210    6880555-DM             120 NEWTON ST          74
7480    6873706-CN         3217 W RUFFNER ST         100
7868    6883278-DM  324 LAKE WASHINGTON BLVD           0
7999    6882054-CN         1942 WESTLAKE AVE          35
9032    6880515-DM          8338 27TH AVE NW          90
10047   6880514-CN          8338 27TH AVE NW          90
11012   6879171-DM            1617 3RD AVE W          76
11273   6875557-CN          10440 67TH AVE S          92
11640   6873831-CN     711 M L KING JR WAY S          83
11709   6863817-CN           3222 NE 96TH ST          46
11718   6873274-CN             1615 23RD AVE          78
11895   6869481-CN          7753 26TH AVE NW          90
11898   6871773-CN          7755 26TH AVE NW          90
12488   6790584-DM           3815 4TH AVE NE          70
20519   6783372-CN           2346 43RD AVE E           0
151790  6783022-EX           2346 43RD AVE E           0
163426  6884721-CN              603 35TH AVE          78
163979  6878586-CN          7324 19TH AVE NW          90
165251  6944323-CN       5619 SEAVIEW AVE NW           0
165806  6930076-EX           3459 NW 54TH ST           0
165873  6932038-CN  1140 SW MASSACHUSETTS ST           0
165954  6933589-EX  1140 SW MASSACHUSETTS ST           0
167654  6974838-CN        9700 RAINIER AVE S           0
After correction the zip codes:
         PermitNum          OriginalAddress1 OriginalZip
440     6968761-CN       6049 SEAVIEW AVE NW       98107
6210    6880555-DM             120 NEWTON ST       98109
7480    6873706-CN         3217 W RUFFNER ST       98199
7868    6883278-DM  324 LAKE WASHINGTON BLVD       98122
7999    6882054-CN         1942 WESTLAKE AVE       98101
9032    6880515-DM          8338 27TH AVE NW       98117
10047   6880514-CN          8338 27TH AVE NW       98117
11012   6879171-DM            1617 3RD AVE W       98119
11273   6875557-CN          10440 67TH AVE S       98178
11640   6873831-CN     711 M L KING JR WAY S       98144
11709   6863817-CN           3222 NE 96TH ST       98115
11718   6873274-CN             1516 23RD AVE       98122
11895   6869481-CN          7753 26TH AVE NW       98117
11898   6871773-CN          7755 26TH AVE NW       98117
12488   6790584-DM           3815 4TH AVE NE       98105
20519   6783372-CN           2346 43RD AVE E       98112
151790  6783022-EX           2346 43RD AVE E       98112
163426  6884721-CN              603 35TH AVE       98122
163979  6878586-CN          7324 19TH AVE NW       98117
165251  6944323-CN       5619 SEAVIEW AVE NW       98107
165806  6930076-EX           3459 NW 54TH ST       98107
165873  6932038-CN  1140 SW MASSACHUSETTS ST       98134
165954  6933589-EX  1140 SW MASSACHUSETTS ST       98134
167654  6974838-CN        9700 RAINIER AVE S       98118
After correction the zip codes:
         PermitNum          OriginalAddress1 OriginalZip
440     6968761-CN       6049 SEAVIEW AVE NW       98107
6210    6880555-DM             120 NEWTON ST       98109
7480    6873706-CN         3217 W RUFFNER ST       98199
7868    6883278-DM  324 LAKE WASHINGTON BLVD       98122
7999    6882054-CN         1942 WESTLAKE AVE       98101
9032    6880515-DM          8338 27TH AVE NW       98117
10047   6880514-CN          8338 27TH AVE NW       98117
11012   6879171-DM            1617 3RD AVE W       98119
11273   6875557-CN          10440 67TH AVE S       98178
11640   6873831-CN     711 M L KING JR WAY S       98144
11709   6863817-CN           3222 NE 96TH ST       98115
11718   6873274-CN             1516 23RD AVE       98122
11895   6869481-CN          7753 26TH AVE NW       98117
11898   6871773-CN          7755 26TH AVE NW       98117
12488   6790584-DM           3815 4TH AVE NE       98105
20519   6783372-CN           2346 43RD AVE E       98112
151790  6783022-EX           2346 43RD AVE E       98112
163426  6884721-CN              603 35TH AVE       98122
163979  6878586-CN          7324 19TH AVE NW       98117
165251  6944323-CN       5619 SEAVIEW AVE NW       98107
165806  6930076-EX           3459 NW 54TH ST       98107
165873  6932038-CN  1140 SW MASSACHUSETTS ST       98134
165954  6933589-EX  1140 SW MASSACHUSETTS ST       98134
167654  6974838-CN        9700 RAINIER AVE S       98118

Let's check whether the zip codes belong to the Seattle area. The zip codes for comparison were taken from Zipcodes-us.com. There are two zip codes that do not belong to Seattle: 98019 (Duvall, WA) and 98004 (Bellevue, WA). It appears that both addresses can be found in Seattle, and neither can be found in the zip code. I change the zip codes to the 98199, and 98115, respectively.

In [ ]:
df_org['OriginalZip'].info()
<class 'pandas.core.series.Series'>
Index: 168017 entries, 0 to 168017
Series name: OriginalZip
Non-Null Count   Dtype 
--------------   ----- 
157323 non-null  object
dtypes: object(1)
memory usage: 6.6+ MB
In [ ]:
'''
    Checks whether the zip codes belong to the Seattle area    
    @param df_ the data frame
    @return True the dataframe has only seattle zipcodes
            False the dataframe has some outsiders
'''
def check_zipcode_area(df_):
    SEA_ZIPCODES = [98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98110, 98111, 98112, 98113, 98114, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98127, 98129, 98131, 98133, 98134, 98136, 98138, 98139, 98141, 98144, 98145, 98146, 98148, 98154, 98155, 98158, 98160, 98161, 98164, 98165, 98166, 98168, 98170, 98174, 98175, 98177, 98178, 98181, 98185, 98188, 98190, 98191, 98194, 98195, 98198, 98199]
    zcodes = set(pd.to_numeric(df_['OriginalZip'].dropna().unique()))

    if (zcodes.issubset(SEA_ZIPCODES)):
        print(f"Column has only Seattle zip codes: {zcodes}")
        return True
    else:
        diff = zcodes.difference(SEA_ZIPCODES)
        print(f"There are zip codes that are not Seattle zipcodes: {zcodes.difference(SEA_ZIPCODES)}")
        for c in diff:
            outsiders = df_[ df_['OriginalZip'] ==  c.astype(str)]
            print(f"\nZip code {c} count: {len(outsiders)}")
            print(f"The records for {c}\n{outsiders.head()}")
        return False

check_zipcode_area(df_org)
There are zip codes that are not Seattle zipcodes: {98019, 98004}

Zip code 98019 count: 1
The records for 98019
        PermitNum           PermitClass PermitClassMapped PermitTypeMapped  \
38131  6666330-CN  Single Family/Duplex       Residential         Building   

            PermitTypeDesc  \
38131  Addition/Alteration   

                                                                                                       Description  \
38131  Interior alteration (kitchen remodel) with non load bearing wall removal subject to field inspection( STFI)   

       HousingUnits  HousingUnitsRemoved  HousingUnitsAdded  EstProjectCost  \
38131           NaN                  NaN                NaN         45000.0   

      AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent  \
38131  2018-05-02 2018-05-02  2019-11-02           NaT     Completed   

      RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip  \
38131        NaN  2507 28TH AVE W      SEATTLE            WA       98019   

      ContractorCompanyName  \
38131                   NaN   

                                                                                    Link  \
38131  https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6666330-CN   

       Latitude   Longitude                     Location1 CostBinned  
38131  47.64147 -122.393351  (47.64147029, -122.39335124)     <=100K  

Zip code 98004 count: 1
The records for 98004
        PermitNum PermitClass PermitClassMapped PermitTypeMapped  \
38038  6667095-CN  Commercial   Non-Residential         Building   

            PermitTypeDesc                                   Description  \
38038  Addition/Alteration  Re-Roof commercial building for Mass & Mass.   

       HousingUnits  HousingUnitsRemoved  HousingUnitsAdded  EstProjectCost  \
38038           NaN                  NaN                NaN             0.0   

      AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent  \
38038  2018-05-08 2018-05-08  2019-11-08           NaT       Expired   

      RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip  \
38038        NaN  5844 NE 75TH ST      SEATTLE            WA       98004   

      ContractorCompanyName  \
38038                   NaN   

                                                                                    Link  \
38038  https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6667095-CN   

       Latitude  Longitude                     Location1 CostBinned  
38038  47.68342 -122.26404  (47.68341985, -122.26404013)     <=100K  
Out[ ]:
False
In [ ]:
# corrections of the zip codes to Seattle codes from Duvall, and Bellevue
df_org.at[38131, 'OriginalZip'] = '98199'
df_org.at[38038, 'OriginalZip'] = '98115'

# again check the zip code area
check_zipcode_area(df_org)
Column has only Seattle zip codes: {98177, 98178, 98195, 98199, 98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98155, 98164, 98168}
Out[ ]:
True

Latitude, Longitude, and Location1¶

According to field descriptions Location1 is a field that combines Latitude and Longitude. Let's check if indeed it is a case. It seems that there is no discrepancy between Latitude and Longitude and corresponding Location1. I have found no discrepancies if the difference between corresponding values was set either to 0.0001 or 1e-09. The numbers are displayed in Latitude and Longitued are displayed not in the same precision as in Location1. After the closer examination it looks like the precision is the same.

In [ ]:
'''
    Checks the consistencey of the geolocation

    @param df_ The dataframe
    @return None
'''
def check_geo(df_):
    df = df_[['Latitude', 'Longitude', 'Location1']].copy()
    # for checking get rid of the parens left and right and split on comma
    df[['Lat', 'Lon']] = df['Location1'].str.replace('(', '').str.replace(')','').str.split(',', n=1, expand=True)
    df['Lat'] = pd.to_numeric(df['Lat'])
    df['Lon'] = pd.to_numeric(df['Lon'])
    
    df['absLat'] = df['Latitude'] - df['Lat']    
    df['absLon'] = df['Longitude'] - df['Lon']
    df['absLat'].abs()
    df['absLon'].abs()
    
    print(df.head())
    
    def error(df1_, error_):        
        x = df1_[ (df1_['absLat'] >= error_) | (df1_['absLon'] >= error_) ]
        print(f"Length of diff for Latitude and  Longitude and Location1 >= {error_} = {len(x)}")
    
    error(df, 0.0001)
    error(df, 0.000000001)    

check_geo(df_org)
    Latitude   Longitude                     Location1        Lat         Lon  \
0  47.663480 -122.371648  (47.66348025, -122.37164843)  47.663480 -122.371648   
1  47.511472 -122.248589  (47.51147177, -122.24858896)  47.511472 -122.248589   
2  47.684953 -122.362213  (47.68495344, -122.36221339)  47.684953 -122.362213   
3  47.558581 -122.266645  (47.55858123, -122.26664479)  47.558581 -122.266645   
4  47.610569 -122.342377  (47.61056886, -122.34237665)  47.610569 -122.342377   

   absLat  absLon  
0     0.0     0.0  
1     0.0     0.0  
2     0.0     0.0  
3     0.0     0.0  
4     0.0     0.0  
Length of diff for Latitude and  Longitude and Location1 >= 0.0001 = 0
Length of diff for Latitude and  Longitude and Location1 >= 1e-09 = 0

Now, let's take a look whether the location values are within the Seattle area. The viz shows us that there might be some concern regarding geolocations as the coordinates span much more than the area of Seattle. It needs to be taken into account when working with the data.

In [ ]:
# center of the map
MAP_CENTER_LAT = 47.5 
MAP_CENTER_LNG = -122.249

def plot_scatter_geo(df_, title_=""):
    
    maxLat = df_['Latitude'].max()
    minLat = df_['Latitude'].min()
    maxLon = df_['Longitude'].max()
    minLon = df_['Longitude'].min()

    bounds = pd.DataFrame({
        'Lat' : [ maxLat, maxLat, minLat, minLat ],
        'Lon' : [ minLon, maxLon, minLon, maxLon ]
    })

    print(f"The boundary of locations in a dataframe (max and min values).\n{bounds}")
    #print(df_)
    #f = px.scatter_geo(bounds, lat = 'Lat', lon = 'Lon', 
    #                   #center = dict(lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG),
    #                   scope='usa',
    #                   mapbox_style = MAP_STYLE,
    #                   title = title_)
    f = px.density_mapbox(bounds, lat = 'Lat', lon = 'Lon', zoom=9,
                          mapbox_style=MAP_STYLE, 
                          title = title_)
    f.update_geos(fitbounds="locations") 
    
    return f

plot_scatter_geo(df_org, "Boundaries of Latitude and Longitude in the dataset.").show()
The boundary of locations in a dataframe (max and min values).
         Lat         Lon
0  47.733972 -122.430368
1  47.733972 -122.221100
2  47.495809 -122.430368
3  47.495809 -122.221100

Summary¶

At this point, I consider the data is clean enough and can be used for further processing.

The data comes from the credible source and is a Public Domain dataset.

The data cleanup included:

  • Unifying number formats,
  • Quantifying missing data,
  • Gaining preliminary insights into categorical and numerical values contained in this dataset,
  • Addressing duplicates,
  • Unifying categorical observations to uppercase letters,
  • Checking data ranges,
  • Correcting some values of OriginalZip and OriginalCity,
  • Verifying geolocation coordinates, i.e., range and consistency between Location1 and individual Latitude and Longitude,
  • Adding the field 'CostBinned' to categorize the estimated cost.
In [ ]:
# write to csv our clean data
result, mesg = ut.wrt_to_csv(df_org, DATA_PATH_PROCESSED)
print(f"Writing to file: {DATA_PATH_PROCESSED}.\nResult = {result}\nMessage={mesg}")
Writing to file: input/Building_Permits_20231024_clean.zip.
Result = False
Message=WARN: File exists in location: input/Building_Permits_20231024_clean.zip. No write ...

Dataset Usage¶

In [ ]:
md(f"The dataset after processing can be found in {DATA_PATH_PROCESSED}.")
Out[ ]:

The dataset after processing can be found in input/Building_Permits_20231024_clean.zip.

The data should be read as follows:

df = pd.read_csv(file_location, thousands=',', 
    parse_dates= ['AppliedDate', 'IssuedDate', 'ExpiresDate', 'CompletedDate'],
                 dtype={'OriginalZip' : 'object'})

And the types of the current dataframe are as follows:

In [ ]:
df_org.info()
<class 'pandas.core.frame.DataFrame'>
Index: 168017 entries, 0 to 168017
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   PermitNum              168017 non-null  object        
 1   PermitClass            161577 non-null  object        
 2   PermitClassMapped      161577 non-null  object        
 3   PermitTypeMapped       168017 non-null  object        
 4   PermitTypeDesc         157436 non-null  object        
 5   Description            167730 non-null  object        
 6   HousingUnits           26697 non-null   float64       
 7   HousingUnitsRemoved    79331 non-null   float64       
 8   HousingUnitsAdded      79331 non-null   float64       
 9   EstProjectCost         133260 non-null  float64       
 10  AppliedDate            129286 non-null  datetime64[ns]
 11  IssuedDate             122470 non-null  datetime64[ns]
 12  ExpiresDate            122626 non-null  datetime64[ns]
 13  CompletedDate          84417 non-null   datetime64[ns]
 14  StatusCurrent          168017 non-null  object        
 15  RelatedMup             13688 non-null   object        
 16  OriginalAddress1       167580 non-null  object        
 17  OriginalCity           161380 non-null  object        
 18  OriginalState          161469 non-null  object        
 19  OriginalZip            157323 non-null  object        
 20  ContractorCompanyName  28358 non-null   object        
 21  Link                   168017 non-null  object        
 22  Latitude               165228 non-null  float64       
 23  Longitude              165228 non-null  float64       
 24  Location1              165228 non-null  object        
 25  CostBinned             133260 non-null  category      
dtypes: category(1), datetime64[ns](4), float64(6), object(15)
memory usage: 37.5+ MB

And the types of the dataframe after opening the cleaned dataframe, saved to a file, and compressed file is as below:

In [ ]:
df_clean = ut.read_to_df(DATA_PATH_PROCESSED, False)
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168017 entries, 0 to 168016
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   PermitNum              168017 non-null  object        
 1   PermitClass            161577 non-null  object        
 2   PermitClassMapped      161577 non-null  object        
 3   PermitTypeMapped       168017 non-null  object        
 4   PermitTypeDesc         157436 non-null  object        
 5   Description            167730 non-null  object        
 6   HousingUnits           26697 non-null   float64       
 7   HousingUnitsRemoved    79331 non-null   float64       
 8   HousingUnitsAdded      79331 non-null   float64       
 9   EstProjectCost         133260 non-null  float64       
 10  AppliedDate            129286 non-null  datetime64[ns]
 11  IssuedDate             122470 non-null  datetime64[ns]
 12  ExpiresDate            122626 non-null  datetime64[ns]
 13  CompletedDate          84417 non-null   datetime64[ns]
 14  StatusCurrent          168017 non-null  object        
 15  RelatedMup             13688 non-null   object        
 16  OriginalAddress1       167580 non-null  object        
 17  OriginalCity           161380 non-null  object        
 18  OriginalState          161469 non-null  object        
 19  OriginalZip            157323 non-null  object        
 20  ContractorCompanyName  28358 non-null   object        
 21  Link                   168017 non-null  object        
 22  Latitude               165228 non-null  float64       
 23  Longitude              165228 non-null  float64       
 24  Location1              165228 non-null  object        
 25  CostBinned             133260 non-null  object        
dtypes: datetime64[ns](4), float64(6), object(16)
memory usage: 33.3+ MB

Tests¶

In [ ]:
'''
    Quick test for values in a dataframe

    @param df_ The dataframe

    @return None
'''
def test_values(df_):
    test_str = '\n***** TEST  *****\n'
   
    def test_passed(msg_='PASSED'):
        print(msg_)
        
    print(f'{test_str}There should be only names or "nan": ')
    print_uniq_vals(df_, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped', 
                         'PermitTypeDesc', 'StatusCurrent', 'OriginalCity', 
                         'OriginalState'])
    test_passed("ATTENTION: TEST RESULTS (ABOVE) NEED MANUAL VERIFICATION")
    
    # check uppercase letters in 'OriginalAddress1'
    print(f'{test_str}Checking whether OriginalAddress1 is in uppercase: ')
    assert (df_['OriginalAddress1'].str.upper().equals(df_['OriginalAddress1']))
    test_passed()
    
    # check if there are any duplicates
    print(f'{test_str}Checking duplicates in PermitNum...')
    assert(len(ut.are_there_duplicates(df_)) == 0)
    test_passed()
   
    # check correction of the number of housing units
    print(f"{test_str}Checking if HousingUnits column was corrected ...")
    assert (len(df_[df_['HousingUnits'] > 10000]) == 0)
    test_passed()

    # check if we have only SEATTLE locations
    print(f"{test_str}Checking if OriginalCity was corrected ...")
    assert (len(df_[df_['OriginalCity'].notnull() & (df_['OriginalCity'] != 'SEATTLE') ]) == 0)
    test_passed()
   
    # check the zip code area if they are only from Seattle
    print(f"{test_str}Checking if zip codes are from Seattle ...")
    assert (check_zipcode_area(df_) == True)
    test_passed()

    print(f"\n\n************ TESTS SUMMARY *****************")
    print(f"Check manually results of the first test.")
    print("All automatic tests: PASSED")

# performing final tests
print('Performing final tests on a cleaned dataframe....')
test_values(df_clean)
Performing final tests on a cleaned dataframe....

***** TEST  *****
There should be only names or "nan": 

---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily'
 'Commercial' 'Institutional' nan]

---------- PermitClassMapped: ['Non-Residential' 'Residential' nan]

---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof'
 'ECA and Shoreline Exemption/Street Improvement Exception Request']

---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan
 'Change of Use Only - No Construction' 'Deconstruction' 'Temporary'
 'Relocation' 'Curb Cut' 'Environmentally Critical Area Exemption'
 'Shoreline Exemption' 'Tenant Improvement Pre-Approval'
 'Street Improvement Exception' 'Relief from Prohibition on Steep Slope'
 'Modification to Submittal Requirements' 'Shoreline Permit Exemption'
 'ECA Small Project Waiver']

---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process'
 'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required'
 'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed'
 'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired'
 'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated'
 'Ready for Intake' 'Scheduled and Submitted' 'Pending']

---------- OriginalCity: ['SEATTLE' nan]

---------- OriginalState: ['WA' nan]
ATTENTION: TEST RESULTS (ABOVE) NEED MANUAL VERIFICATION

***** TEST  *****
Checking whether OriginalAddress1 is in uppercase: 
PASSED

***** TEST  *****
Checking duplicates in PermitNum...
PASSED

***** TEST  *****
Checking if HousingUnits column was corrected ...
PASSED

***** TEST  *****
Checking if OriginalCity was corrected ...
PASSED

***** TEST  *****
Checking if zip codes are from Seattle ...
Column has only Seattle zip codes: {98177, 98178, 98195, 98199, 98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98155, 98164, 98168}
PASSED


************ TESTS SUMMARY *****************
Check manually results of the first test.
All automatic tests: PASSED

Exporting To HTML¶

For exporting externally and hide the code-cells in the export file do:

$ jupyter nbconvert --to html --no-input file.ipynb

References¶

[1] Seattle Department of Construction and Inspections, Seattle Open Data – Building Permits. City of Seattle, 2023. [Online]. Available: https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr

[2] Permit and Property Records [Online]. Available https://web.seattle.gov/dpd/edms/